136 research outputs found
Effect of Architectures and Training Methods on the Performance of Learned Video Frame Prediction
We analyze the performance of feedforward vs. recurrent neural network (RNN)
architectures and associated training methods for learned frame prediction. To
this effect, we trained a residual fully convolutional neural network (FCNN), a
convolutional RNN (CRNN), and a convolutional long short-term memory (CLSTM)
network for next frame prediction using the mean square loss. We performed both
stateless and stateful training for recurrent networks. Experimental results
show that the residual FCNN architecture performs the best in terms of peak
signal to noise ratio (PSNR) at the expense of higher training and test
(inference) computational complexity. The CRNN can be trained stably and very
efficiently using the stateful truncated backpropagation through time
procedure, and it requires an order of magnitude less inference runtime to
achieve near real-time frame prediction with an acceptable performance.Comment: Accepted for publication at IEEE ICIP 201
MMSR: Multiple-Model Learned Image Super-Resolution Benefiting From Class-Specific Image Priors
Assuming a known degradation model, the performance of a learned image
super-resolution (SR) model depends on how well the variety of image
characteristics within the training set matches those in the test set. As a
result, the performance of an SR model varies noticeably from image to image
over a test set depending on whether characteristics of specific images are
similar to those in the training set or not. Hence, in general, a single SR
model cannot generalize well enough for all types of image content. In this
work, we show that training multiple SR models for different classes of images
(e.g., for text, texture, etc.) to exploit class-specific image priors and
employing a post-processing network that learns how to best fuse the outputs
produced by these multiple SR models surpasses the performance of
state-of-the-art generic SR models. Experimental results clearly demonstrate
that the proposed multiple-model SR (MMSR) approach significantly outperforms a
single pre-trained state-of-the-art SR model both quantitatively and visually.
It even exceeds the performance of the best single class-specific SR model
trained on similar text or texture images.Comment: 5 pages, 4 figures, accepted for publication in IEEE ICIP 2022
Conferenc
Multi-Scale Deformable Alignment and Content-Adaptive Inference for Flexible-Rate Bi-Directional Video Compression
The lack of ability to adapt the motion compensation model to video content
is an important limitation of current end-to-end learned video compression
models. This paper advances the state-of-the-art by proposing an adaptive
motion-compensation model for end-to-end rate-distortion optimized hierarchical
bi-directional video compression. In particular, we propose two novelties: i) a
multi-scale deformable alignment scheme at the feature level combined with
multi-scale conditional coding, ii) motion-content adaptive inference. In
addition, we employ a gain unit, which enables a single model to operate at
multiple rate-distortion operating points. We also exploit the gain unit to
control bit allocation among intra-coded vs. bi-directionally coded frames by
fine tuning corresponding models for truly flexible-rate learned video coding.
Experimental results demonstrate state-of-the-art rate-distortion performance
exceeding those of all prior art in learned video coding.Comment: Accepted for publication in IEEE International Conference on Image
Processing (ICIP) 202
Perception-Distortion Trade-off in the SR Space Spanned by Flow Models
Flow-based generative super-resolution (SR) models learn to produce a diverse
set of feasible SR solutions, called the SR space. Diversity of SR solutions
increases with the temperature () of latent variables, which introduces
random variations of texture among sample solutions, resulting in visual
artifacts and low fidelity. In this paper, we present a simple but effective
image ensembling/fusion approach to obtain a single SR image eliminating random
artifacts and improving fidelity without significantly compromising perceptual
quality. We achieve this by benefiting from a diverse set of feasible
photo-realistic solutions in the SR space spanned by flow models. We propose
different image ensembling and fusion strategies which offer multiple paths to
move sample solutions in the SR space to more desired destinations in the
perception-distortion plane in a controllable manner depending on the fidelity
vs. perceptual quality requirements of the task at hand. Experimental results
demonstrate that our image ensembling/fusion strategy achieves more promising
perception-distortion trade-off compared to sample SR images produced by flow
models and adversarially trained models in terms of both quantitative metrics
and visual quality.Comment: 5 pages, 4 figures, accepted for publication in IEEE ICIP 2022
Conferenc
Multimodal person recognition for human-vehicle interaction
Next-generation vehicles will undoubtedly feature biometric person recognition as part of an effort to improve the driving experience. Today's technology prevents such systems from operating satisfactorily under adverse conditions. A proposed framework for achieving person recognition successfully combines different biometric modalities, borne out in two case studies
Focal-Plane Change Triggered Video Compression for Low-Power Vision Sensor Systems
Video sensors with embedded compression offer significant energy savings in transmission but incur energy losses in the complexity of the encoder. Energy efficient video compression architectures for CMOS image sensors with focal-plane change detection are presented and analyzed. The compression architectures use pixel-level computational circuits to minimize energy usage by selectively processing only pixels which generate significant temporal intensity changes. Using the temporal intensity change detection to gate the operation of a differential DCT based encoder achieves nearly identical image quality to traditional systems (4dB decrease in PSNR) while reducing the amount of data that is processed by 67% and reducing overall power consumption reduction of 51%. These typical energy savings, resulting from the sparsity of motion activity in the visual scene, demonstrate the utility of focal-plane change triggered compression to surveillance vision systems
- …